This vignette aims to demonstrate the skin PPG workflow. This workflow starts from .dcc files, pkc file, and annotation files and works through the preprocessing and exploratory analysis. The code folder in the main directory of this repository contains the .R files corresponding to each of the analysis steps. This pipeline is currently in progress and will have more analysis steps added to it (for example, deconvolution). Below is a demonstration of the currently steps of this pipelineline.
In the following sections of this vignette, we will walk through each substep of the analysis to thoroughly explain what the code is doing. Currently, the pipeline has more functionality than necessary, and some substeps can be “turned off and on” when necessary.
Before we begin, we need to load the packages required for this workflow.
Code
################################################################################## The purpose of this script is to load and install the required packages ## For the workflow################################################################################if (!require("BiocManager", quietly =TRUE))install.packages("BiocManager")# The following initializes most up to date version of Bioc#BiocManager::install(version="3.15")#BiocManager::install("NanoStringNCTools")#BiocManager::install("GeomxTools")#BiocManager::install("GeoMxWorkflows")# For preprocessinglibrary(NanoStringNCTools)library(GeomxTools)library(GeoMxWorkflows)library(knitr)library(ggplot2)library(ggforce)library(dplyr)library(scales) # for percentlibrary(reshape2) library(cowplot) # For batch correctionlibrary(sva)# For dim reductionlibrary(umap)library(Rtsne)library(pheatmap) # for pheatmap# FOr modelinglibrary(geepack)source(paste0("../R/",dir("../R")))
QC and Normalization
The quality control and normalization section starts with the raw data files (.dcc, pkc, and annotation files) and works through the following substeps:
Data loading
Quality control
Segment QC (Default = ON)
Probe QC (Default = ON)
Limit of quanitification (LOQ) (Default = ON)
Normalization
Quantile 3 (Q3) normalization
Background normalization
Transcripts per million (TPM) (default = ON)
Log transformation (Default = ON, base=2)
Batch correction (Default = ON)
We demonstrate the order and structure of these steps in the diagram below. Please note, in this diagram the arrows do not represent a dependency but rather to show the order the data is processed. For example, we do not need to run segment QC in order to run probe QC. However, if both segment QC and probe QC are turned “ON”, then the segment QC will be ran first.
QC and Normalization Workflow
To see the detailed information about segment QC, probe QC, LOQ, and batch correction, click the drop downs below.
Segment QC
Every ROI/AOI segment is tested for:
Raw sequencing reads: segments with >1000 raw reads are removed.
% Aligned,% Trimmed, or % Stitched sequencing reads: segments below ~80% for one or more of these QC parameters are removed.
% Sequencing saturation ([1-deduplicated reads/aligned reads]%): segments below ~50% require additional sequencing to capture full sample diversity and are not typically analyzed until improved.
Negative Count: this is the geometric mean of the several unique negative probes in the GeoMx panel that do not target mRNA and establish the background count level per segment; segments with low negative counts (1-10) are not necessarily removed but may be studied closer for low endogenous gene signal and/or insufficient tissue sampling.
No Template Control (NTC) count: values >1,000 could indicate contamination for the segments associated with this NTC; however, in cases where the NTC count is between 1,000- 10,000, the segments may be used if the NTC data is uniformly low (e.g. 0-2 counts for all probes).
Nuclei: >100 nuclei per segment is generally recommended; however, this cutoff is highly study/tissue dependent and may need to be reduced; what is most important is consistency in the nuclei distribution for segments within the study.
Area: generally correlates with nuclei; a strict cutoff is not generally applied based on area.
Default Values Of Segment QC Metrics
Metric
Default value
Aligned, trimmed, and stitched percentage
80 %
Sequencing saturation
50 %
Negative count
1
No template control
9000
Min. Nuclei
20
Min. Area
1000
Probe QC
We will remove low-performing probes before summarizing our data into gene-level count data. In short, this QC is an outlier removal process whereby we remove probes in one of two ways.
Entirely from the study (global)
From specific segments (local).
A probe is removed globally from the dataset if either of the following is true:
The geometric mean of that probe’s counts from all samples divided by the geometric mean of all probe counts representing the target from all samples is less than 0.1. (this only effects the negative control probes)
The probe is an outlier according to the Grubb’s test in at least 20% of the segments.
We remove a probe locally (from a given segment) if the probe is an outlier according to the Grubb’s test in that segment.
Explanation of the Grubbs test (Joel)
The nanostring github shows they are performing the Grubbs test on the log base ten scale for all probe values within a sample. Below is the Grubbs test procedure they followed.
From their code, they run the Grubbs test on each sample, return the probe name with the max value if the Grubb’s test rejected the null hypothesis.
Limit of quantification (LOQ)
In addition to Segment and Probe QC, we also determine the limit of quantification (LOQ) per segment. The LOQ is calculated based on the distribution of negative control probes and is intended to approximate the quantifiable limit of gene expression per segment . Please note that this process is more stable in larger segments. Likewise, the LOQ may not be as accurately reflective of true signal detection rates in segments with low negative probe counts (ex: <2). The formula for calculating the LOQ in the \(i^{th}\) segment is:
We typically use 2 geometric standard deviations (\(n = 2\)) above the geometric mean as the LOQ, which is reasonable for most studies. We also recommend that a minimum LOQ of 2 be used if the LOQ calculated in a segment is below this threshold.
Filtering using LOQ
After determining the limit of quantification (LOQ) per segment, nanostring recommends filtering out either segments and/or genes with abnormally low signals.
Nanostring determines the number of genes detected in each segment across the dataset. We plot the number of segments based on the percentage of genes that fall above the LOQ threshold, allowing us to see the impact of filtering. Currently, the code filters all segments with less than 10% of genes above the LOQ threshold. Additionally, any genes not falling above the LOQ threshold in at least 10% of segments are filtered out.
Batch correction
Combat allows users to adjust for batch effects in datasets where the batch covariate is known, using the methodology described in Johnson et al. 2007. It uses parametric or non-parametric empirical Bayes frameworks to adjust data for batch effects. By default, this method uses a parametric adjustment. This method returns an expression matrix with corrected batch effects. The input data are assumed to be cleaned and normalized before batch effect removal.
Where to edit code for QC and Normalization.
All the places you need to edit the code are at the top of the script. You will need to edit
Which QC parts to use
Segment QC (SegmentQC)
Probe QC (ProveQC)
Limit of quantification (LOQ)
Normalization method to use
The options are
Quantile normalization (q_norm)
Background normalization (b_norm)
Transcripts per million (TPM)
Log transformation (log_transformation)
If you set log_transformation= TRUE, then you must provide a base to use.
Batch effect (batchEff)
Once you have identified the parts to include in the analysis, you will need to provide the variable names in the annotation file that contain the following information:
ROI
AOI
Batch (only include if batchEff=TRUE)
DCC names (DCC_col)
This is a variable that links the metadata to the .dcc files.
Finally, provide paths to the folders that contain the .dcc files, the pkc file, and the annotation excel file.
Code
################################################################################## This script is used for the following steps### 1. Data loading (.DCC files, PKC files and sample annotation)## 2. Quality Control## - SegmentQC (default = ON)## - ProbeQC (default = ON)## - Limit of Quantification QC (LOQ) (default = OFF)## 3. Normalization## - Q3 Normalization## - Background Normalization## - Transcripts per million (default = ON)#### 4. Log transformation (default = ON, base=2)#### 5. Batch correction (default = ON)################################################################################ Which QC metrics would you like to use?SegmentQC = TProbeQC = TLOQ = F# Which normalization would you like to use? # Options at q_norm (Q3 normalization ), b_norm (background normalization)#. or TPM (transcripts per million)norm_method ="TPM"# Do you want to use a log transformation? If so what base do you want to use?log_transformation = Tbase =2# Do you want to include batch effect adjustment?batchEff = T# What are the ROI and AOI names in the annotation fileROI ="region"AOI ="SegmentLabel"# Batch is only required if batchEff=Tbatch ="Experiment"DCC_col ="Sample_ID"################################################################################## Step 1: ### DCC files (folder), PKC Files (folder)### and Sample annotation (folder)################################################################################ Provide path to the DCC files, PKC file and Sample annotationDCC ="../Data/Raw/DCC"PKC ="../Data/Raw/PKC"SampleAnnotation ="../Data/Raw/Annotation"DCC_col ="Sample_ID"########################################## Get all paths for the raw data####################################### DCCDCCFiles <-dir(DCC, pattern =".dcc$",full.names =TRUE, recursive =TRUE)# PKCPKCFiles <-dir(PKC, pattern =".pkc$", full.names =TRUE, recursive =TRUE)# Sample annotation (Alternatively you could provide a direct path)SampleAnnotationFile <-dir(SampleAnnotation, pattern =".xlsx$",full.names =TRUE, recursive =TRUE)############################################ Load all data into the nanostring ### object to store all data together.########################################## 1. load data (This may take a few moments)demoData <-readNanoStringGeoMxSet(dccFiles = DCCFiles,pkcFiles = PKCFiles,phenoDataFile = SampleAnnotationFile,phenoDataSheet ="Sheet1",phenoDataDccColName = DCC_col)################################################################################### Step 2: Quality Control### Consists of the following subparts### - Count shift### - Segment QC### - Probe QC### - LOQ (Limit of quantification filtering############################################################################################################################################## Shift counts to one#############################################################demoData <-shiftCountsOne(demoData, useDALogic =TRUE)############################################################## Segment QC############################################################if(SegmentQC==T){# Set the QC parameters that we wish to use. These can be edited. QC_params <-list(minSegmentReads =1000, # Minimum number of reads (1000)percentTrimmed =80, # Minimum % of reads trimmed (80%)percentStitched =80, # Minimum % of reads stitched (80%)percentAligned =80, # Minimum % of reads aligned (80%)percentSaturation =50, # Minimum sequencing saturation (50%)minNegativeCount =1, # Minimum negative control counts (1)maxNTCCount =9000, # Maximum counts observed in NTC well (9000)minNuclei =20, # Minimum # of nuclei estimated (20)minArea =1000) # Minimum segment area (1000)# Include the QC parameters in the nanostring object demoData <-setSegmentQCFlags(demoData, qcCutoffs = QC_params) ################################################# Create QC summary table based on### these flags.############################################## QCResults <-protocolData(demoData)[["QCFlags"]] flag_columns <-colnames(QCResults) QC_Summary <-data.frame(Pass =colSums(!QCResults[, flag_columns]),Warning =colSums(QCResults[, flag_columns])) QCResults$QCStatus <-apply(QCResults, 1L, function(x) {ifelse(sum(x) == 0L, "PASS", "WARNING") })# Display summary of the results# print( knitr::kable(QC_Summary["TOTAL FLAGS", ] <-# c(sum(QCResults[, "QCStatus"] == "PASS"),# sum(QCResults[, "QCStatus"] == "WARNING"))))################################################### Visualize segment QC################################################ col_by <- AOI# Trimmed % print( QC_histogram(sData(demoData), "Trimmed (%)", col_by, QC_params$percentTrimmed))# Stitched %print(QC_histogram(sData(demoData), "Stitched (%)", col_by, QC_params$percentStitched))# Aligned %print(QC_histogram(sData(demoData), "Aligned (%)", col_by, QC_params$percentAligned))# Saturated %print(QC_histogram(sData(demoData), "Saturated (%)", col_by, QC_params$percentSaturation) +labs(title ="Sequencing Saturation (%)",x ="Sequencing Saturation (%)"))# Areaprint(QC_histogram(sData(demoData), "AOISurfaceArea", col_by, QC_params$minArea,scale_trans ="log10"))# Nucleiprint(QC_histogram(sData(demoData), "AOINucleiCount", col_by, QC_params$minNuclei))################################################### Calculate and plot geometric mean of the ## Negative probes for each segment.################################################## gene annotations modules# Assign the pkcs variable to the name of the gene annotation file pkcs <-annotation(demoData)# Create table with gene annotation file names modules <-gsub(".pkc", "", pkcs)# Calculate the geometric means of the negative controls for each of the samples. negativeGeoMeans <-esBy(negativeControlSubset(demoData), GROUP ="Module", FUN =function(x) { assayDataApply(x, MARGIN =2, FUN = ngeoMean, elt ="exprs") }) # Store the geometric means in the protocol (metadata)protocolData(demoData)[["NegGeoMean"]] <- negativeGeoMeans# Explicitly copy the Negative geoMeans from sData to pData negCols <-paste0("NegGeoMean_", modules)pData(demoData)[, negCols] <-sData(demoData)[["NegGeoMean"]]# Plot the geoMetric means for each sample. for(ann in negCols) { plt <-QC_histogram(pData(demoData), ann, col_by, QC_params$minNegativeCount, scale_trans ="log10")print(plt) }################################# Display Segment QC results## and filter segments################################ Segment QC Summaryprint(kable(QC_Summary, caption ="QC Summary Table for each Segment"))# Filter only segments that passed # Subset data only on the samples that passed the QC Metrics. demoData <- demoData[, QCResults$QCStatus =="PASS"]}
QC Summary Table for each Segment
Pass
Warning
LowReads
136
0
LowTrimmed
136
0
LowStitched
136
0
LowAligned
136
0
LowSaturation
136
0
LowNegatives
136
0
Code
####################################################################### Probe QC####################################################################if(ProbeQC == T){# Set flag values demoData <-setBioProbeQCFlags(demoData, qcCutoffs =list(minProbeRatio =0.1,percentFailGrubbs =20), removeLocalOutliers =TRUE)# 2. Data set with raised flaggs for each probe. ProbeQCResults <-fData(demoData)[["QCFlags"]]# 3. Table with flag summary count. qc_df <-data.frame(Passed =sum(rowSums(ProbeQCResults[, -1]) ==0),Global =sum(ProbeQCResults$GlobalGrubbsOutlier),Local =sum(rowSums(ProbeQCResults[, -2:-1]) >0&!ProbeQCResults$GlobalGrubbsOutlier))print(kable(qc_df, caption ="Probes flagged or passed as outliers"))##################################### Filter Probes###################################Subset object to exclude all that did not pass Ratio & Global testing ProbeQCPassed <-subset(demoData, fData(demoData)[["QCFlags"]][,c("LowProbeRatio")] ==FALSE&fData(demoData)[["QCFlags"]][,c("GlobalGrubbsOutlier")] ==FALSE)# Resign to demoData demoData <- ProbeQCPassed }
Probes flagged or passed as outliers
Passed
Global
Local
18796
0
19
Code
#################################### Create gene level count matrix################################### Aggregate counts with the geometric mean. target_demoData <-aggregateCounts(demoData)################################################################### Limit of Quantification (LOQ)################################################################if(LOQ==TRUE){################################## Calculate LOQ################################ Define the LOQ cutoff and the minLOQ. cutoff <-2 minLOQ <-2# Calculate LOQ per module tested LOQ <-data.frame(row.names =colnames(target_demoData))for(module in modules) { vars <-paste0(c("NegGeoMean_", "NegGeoSD_"), module)if(all(vars[1:2] %in%colnames(pData(target_demoData)))) { LOQ[, module] <-pmax(minLOQ,pData(target_demoData)[, vars[1]] *pData(target_demoData)[, vars[2]] ^ cutoff) } }# Store the LOQ information.pData(target_demoData)$LOQ <- LOQ######################################## Create matrix with LOQ### Pass/Fail information###################################### Initialize an empty vector to store the LOQ_Mat values. LOQ_Mat <-c()# Run for loop.for(module in modules) { ind <-fData(target_demoData)$Module == module# 2. Check if each value is greater than the corresponding LOQ value. Mat_i <-t(esApply(target_demoData[ind, ], MARGIN =1,FUN =function(x) { x > LOQ[, module] })) LOQ_Mat <-rbind(LOQ_Mat, Mat_i) }# 3. Ensure ordering since this is stored outside of the geomxSet LOQ_Mat <- LOQ_Mat[fData(target_demoData)$TargetName, ]####################################### Segment gene detection rate plot###################################### 1. Save detection rate information to pheno datapData(target_demoData)$GenesDetected <-colSums(LOQ_Mat, na.rm =TRUE)pData(target_demoData)$GeneDetectionRate <-pData(target_demoData)$GenesDetected /nrow(target_demoData)# 2. Classify each sample based on the percentage of genes above the LOQ. pData(target_demoData)$DetectionThreshold <-cut(pData(target_demoData)$GeneDetectionRate,breaks =c(0, 0.01, 0.05, 0.1, 0.15, 1),labels =c("<1%", "1-5%", "5-10%", "10-15%", ">15%"))# 3. Create a barplot of the different classes. (1%, 5%, 10%, 15%) geneplot <-ggplot(pData(target_demoData),aes(x = DetectionThreshold)) +geom_bar(aes(fill =get(ROI))) +geom_text(stat ="count", aes(label = ..count..), vjust =-0.5) +theme_bw() +scale_y_continuous(expand =expansion(mult =c(0, 0.1))) +labs(x ="Gene Detection Rate",y ="Segments, #",fill ="Segment Type", title ="Number of Samples in Gene Detection Rate Samples")print(geneplot)########################################## Filter segments based on ### gene detection rate####################################### target_demoData <- target_demoData[, pData(target_demoData)$GeneDetectionRate >= .01]######################################### Gene gene detection rate plot######################################## Filter LOQ matrix LOQ_Mat <- LOQ_Mat[, colnames(target_demoData)]# Calculate the number of samples detected for each sample.fData(target_demoData)$DetectedSegments <-rowSums(LOQ_Mat, na.rm =TRUE)# Save the dection rate for each gene.fData(target_demoData)$DetectionRate <-fData(target_demoData)$DetectedSegments /nrow(pData(target_demoData))# Create a data frame with for plotting with a frequency column. plot_detect <-data.frame(Freq =c(1, 5, 10, 20, 30, 50))# Calculate number of genes above each threshold. plot_detect$Number <-unlist(lapply(c(0.01, 0.05, 0.1, 0.2, 0.3, 0.5),function(x) {sum(fData(target_demoData)$DetectionRate >= x)}))# Calculate the percentage of genes above this threshold. plot_detect$Rate <- plot_detect$Number /nrow(fData(target_demoData))rownames(plot_detect) <- plot_detect$Freq# Plot gene detection rateprint( ggplot(plot_detect, aes(x =as.factor(Freq), y = Rate, fill = Rate)) +geom_bar(stat ="identity") +geom_text(aes(label =formatC(Number, format ="d", big.mark =",")),vjust =1.6, color ="black", size =4) +scale_fill_gradient2(low ="orange2", mid ="lightblue",high ="dodgerblue3", midpoint =0.65,limits =c(0,1),labels = scales::percent) +theme_bw() +scale_y_continuous(labels = scales::percent, limits =c(0,1),expand =expansion(mult =c(0, 0))) +labs(x ="% of Segments",y ="Genes Detected, % of Panel > LOQ") )# Save plot#ggsave("../Outputs/Plots/LOQGeneDetectionRatePlot.png")############################################ Filter genes based on gene detection### Rate########################################### Subset to target genes detected in at least 10% of the samples. negativeProbefData <-subset(fData(target_demoData), CodeClass =="Negative") neg_probes <-unique(negativeProbefData$TargetName)# Filter out genes detected in less than 10% of genes. target_demoData <- target_demoData[fData(target_demoData)$DetectionRate >=0.1|fData(target_demoData)$TargetName %in% neg_probes, ]} ################################################################################### Step 3: Normalization### There are two types of normalization possible### - Q3 normalization### - Background Normalization### - Transcripts per million (default)##################################################################################################################### Q3 Normalization##################################if(norm_method =="q_norm"){# Q3 norm (75th percentile) for WTA/CTA with or without custom spike-ins target_demoData <-normalize(target_demoData ,norm_method ="quant", desiredQuantile = .75,toElt = norm_method)}##################################### Background normalization###################################if(norm_method =="b_norm"){# Background normalization for WTA/CTA without custom spike-in target_demoData <-normalize(target_demoData ,norm_method ="neg", fromElt ="exprs",toElt = norm_method)}############################################### Transcripts per-million standardization#############################################if(norm_method=="TPM"){ TPM_norm =apply(target_demoData@assayData$exprs, 2,FUN =function(x){ (x/sum(x))*1000000 })# Add assay dataassayDataElement(target_demoData,"TPM") = TPM_norm}################################################# Log transformation##############################################if(log_transformation ==T) {# log transform assay data assayLogT =log(target_demoData@assayData[[norm_method]], base = base)# Add assay data to the assayDataElement(target_demoData,paste0("logT-base",base)) = assayLogT}################################################################################## Batch Effects (here batch info is captured in "Expiriment variable)##. There are two cases for this because if we used log transformation then## we want to use the log transformation assay slot. ################################################################################if(batchEff==T){# If we didnt log transform then we want to use the norm_method slot. if(log_transformation==F){# Do batch correction adjusted_expression <-ComBat(dat = target_demoData@assayData[[norm_method]],batch = target_demoData@phenoData@data$Experiment,mod =NULL, prior.plots = F)# Save batch corrected results in nanostring objectassayDataElement(target_demoData,"batch_corrected") = adjusted_expression }# if log transformed, we will want to you to log transformation assay slotif(log_transformation==T){# Do batch correction adjusted_expression <-ComBat(dat = target_demoData@assayData[[paste0("logT-base",base)]],batch = target_demoData@phenoData@data$Experiment,mod =NULL, prior.plots = F)# Save batch corrected results in nanostring objectassayDataElement(target_demoData,"batch_corrected") = adjusted_expression }}#################################### Save Data##################################save(target_demoData,file ="../Data/Processed/NormalizedData.Rdata")
Exploratory analysis
The exploratory analysis allows us to visualize our data to see high-level patterns without requiring a formal hypothesis. Additionally, the exploratory analysis can serve as a quality control tool. For example, when we do dimensionality reduction, we may notice that most of the variance in the data is coming from the different batches, and we may want to go back to preprocessing and do batch correction before continuing with the downstream analysis. In this exploratory analysis, we are using:
Uniform Manifold Projection (UMAP) for dimensionality reduction.
Heatmaps
Technical Information About UMAP
Uniform manifold approximation and projection (UMAP) is a non-linear dimensionality reduction method rooted in topology. The goal of UMAP is to find a low-dimensional representation that best preserves the local structure of the high-dimensional space. The original method was published in 2018 and is a staple in scRNA-seq pipelines. The mathematical details for UMAP are complex. However, the authors of UMAP provide a high-level introduction to UMAP on their website. Becht et. al show the advantages of UMAP for visualization of scRNA-seq over traditional methods like PCA and t-SNE.
Where to edit code for exploratory analysis.
First, you must “turn off/on” the UMAP and heatmap sections. By default, these are turned on; to turn them off, set UMAP =FALSE and/or Heatmap = FALSE. Then, you will need to provide a path to the data object, which is the output of the QC and normalization step. Since it is possible to have multiple expression matrices after QC and normalization, you need to specify which you want to use. You can find the expression data options using targetdemoData@assays. By default, we are using the batch-corrected data (elt=“batch_corrected”). Finally, for the exploratory analysis, you must provide the names of the ROI, AOI, slide name, and batch information. If you are not using batch correction, you can set this to “batch=NULL”
Code
################################################################################# Eploratory Analysis## This consists of two parts## - Dimintionality reduction## - Heatmaps################################################################################## Which exploratory analysis section would you like to include?UMAP =TRUEHeatmap =TRUE# What is the path to the processed data?data_path ="../Data/Processed/NormalizedData.Rdata"# Which assay to use? to see options use: target_demoData@assayData$elt ="batch_corrected"# Provide the variable names for ROI, AOI, and slide namesROI ="region"AOI ="SegmentLabel"batch ="Experiment"slideName ="SlideName"######################################## Load Processed Data####################################### You may need to edit this code to update pathload(data_path) ######################################## Sample Distribution######################################knitr::kable(target_demoData@phenoData@data %>%select(as.name(slideName),as.name(AOI),as.name(ROI)) %>%group_by(across(all_of(c(slideName, ROI, AOI))))%>%summarise(N=n()) %>%dcast(formula(paste0(slideName,"~",ROI,"+",AOI))),caption ="Distribution of Samples for each slide acrross ROI and (AOI)") %>% kableExtra::kable_paper()
`summarise()` has grouped output by 'SlideName', 'region'. You can override
using the `.groups` argument.
Using N as value column: use value.var to override.
Distribution of Samples for each slide acrross ROI and (AOI)
SlideName
AK Center_panckNeg
AK Center_panckPos
AK Edge_panckNeg
AK Edge_panckPos
Sun Protected_panckNeg
Sun Protected_panckPos
007
3
3
2
2
5
5
015
3
3
3
3
5
5
080
3
3
3
3
5
6
086
3
3
3
3
5
6
SAZT20088-1 SATZ20088-3
3
3
3
3
6
6
SAZT20099-1 SAZT20099-3
3
3
3
3
6
6
Code
######################################## UMAP ######################################if(UMAP ==TRUE){# update defaults for umap to contain a stable random_state (seed) custom_umap <- umap::umap.defaults custom_umap$random_state <-42# run UMAP umap_out <-umap(t(assayDataElement(target_demoData , elt = elt)), config = custom_umap) # Uses default umap settings# Save the results from the UMAP in the data objectpData(target_demoData)[, c("UMAP1", "UMAP2")] <- umap_out$layout[, c(1,2)] if(is.null(batch)){# Plot using ggplotggplot(pData(target_demoData), aes(x = UMAP1, y = UMAP2, color =!!as.name(ROI))) +geom_point(size =3) +theme_bw() +ggtitle("Dim Reduction Using UMAP.") } else {# Plot using ggplotggplot(pData(target_demoData), aes(x = UMAP1, y = UMAP2, color =!!as.name(ROI), shape =!!as.name(batch))) +geom_point(size =3) +theme_bw() +ggtitle("Dim Reduction Using UMAP.") }}
Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
Also defined by 'spam'
Code
######################################## Heatmap######################################if(Heatmap ==TRUE){# Log transform gene expression data. mat <- target_demoData@assayData[[elt]]# Summarize genes by variance and get the top 100 var_genes <-apply(mat,1,var) top50Genes <- var_genes[order(var_genes, decreasing = T)[1:50]] annot <- target_demoData@phenoData@data %>%select(all_of(c(ROI,AOI,slideName))) %>%arrange(!!as.name(slideName), !!as.name(AOI), !!as.name(ROI)) pheatmap(mat[names(top50Genes),rownames(annot)], cluster_rows = F, cluster_cols = F,annotation_col = annot, show_rownames = F, show_colnames = F, main ="Top 50 variable genes (log2-transformed) gene expression", border_color ="black" )}
Trajectory analysis
The goal of this analysis is to find genes that have either a upward or downward trajectory from normal to AK Edge to AK center. In this example, we will assign numeric values to normal, AK edge and AK Center regions of interest.
ROI
Numeric Value
Normal
1
AK Edge
2
AK Center
3
In this example, the analysis will be stratified by AOI (CK + and CK -). We will model each gene using generalized estimating equations (GEE) with the subject as the grouping variable. GEE’s require us to specifiy a working correlation, for this we will use the exchangeable working correlation. We give an example of this modeling strategy for the A2M gene.
GEE modeling is used for each gene and we will focus on genes that that a significant trajectory from normal to AK Center. We summarize the number of significant genes, proportion of significant genes and the number of genes with upward and downward trajectories.
We then show heatmaps focusing on the genes with upward/downward trajectories for each AOI. The values used in the heatmaps are normalized to a standard normal distribution to allow us to see all the trajectories on the same scale. From these heatmaps we may be able to identify a few genes that we want to take a closer look at. For this we plot the trajectories and the 95% confidence interval for the trajectories for specific gene.
Code
############################################################################### Trajectory analysis. Here we are testing for a trajectory from no disease to ## Disease. ############################################################################## What is the path to the processed data?data_path ="../Data/Processed/NormalizedData.Rdata"# Which assay to use? to see options use: target_demoData@assayData$ # here we are using the batch corrected log-transformedelt ="batch_corrected"# Provide the variable names for ROI, AOI, and slide namesROI ="region"AOI ="SegmentLabel"slideName ="SlideName"subject_id ="SlideName"######################################## Load Processed Data####################################### You may need to edit this code to update pathload(data_path) ######################################## Sampling distributions######################################target_demoData@phenoData@data %>%select(as.name(slideName),as.name(AOI),as.name(ROI)) %>%group_by(across(all_of(c(slideName, ROI, AOI))))%>%summarise(N=n()) %>%dcast(formula(paste0(slideName,"~",ROI,"+",AOI))) %>% knitr::kable(col.names =c("SubjectID","AK Center (CK-)","AK Center (CK+)","AK Edge (CK-)","AK Edge (CK+)", "Sun Protected (CK-)","Sun Protected (CK+)"),caption ="Sampling distribution across subjects")
`summarise()` has grouped output by 'SlideName', 'region'. You can override
using the `.groups` argument.
Using N as value column: use value.var to override.
Sampling distribution across subjects
SubjectID
AK Center (CK-)
AK Center (CK+)
AK Edge (CK-)
AK Edge (CK+)
Sun Protected (CK-)
Sun Protected (CK+)
007
3
3
2
2
5
5
015
3
3
3
3
5
5
080
3
3
3
3
5
6
086
3
3
3
3
5
6
SAZT20088-1 SATZ20088-3
3
3
3
3
6
6
SAZT20099-1 SAZT20099-3
3
3
3
3
6
6
Code
################################################ Create model data so gene expression## and metadata are together in one dataset############################################### Get the assay data for the modeling# CPMassay <-data.frame(t(assayDataElement(target_demoData, elt = elt)))# Here we merge the metadata and the assay data into one dataframe and add the # trajectory to the regions. model_dat <-merge(target_demoData@phenoData@data, assay, by="row.names") %>%mutate(regionT =case_when(region =="Sun Protected"~1, region=="AK Edge"~2, region =="AK Center"~3), subject_ID =as.factor(!!as.name(subject_id)) )# Split the data into CKPlus and CKMinus model_datPlus <- model_dat %>%filter(SegmentLabel=="panckPos")model_datMinus <- model_dat %>%filter(SegmentLabel=="panckNeg")# Run trajectory analysis on one genegeeCKPlus <-geeglm(A2M~ regionT, family=gaussian, id=subject_ID, data = model_datPlus,corstr ="exchangeable")geeCKMinus <-geeglm(A2M~ regionT, family=gaussian, id=subject_ID, data = model_datMinus,corstr ="exchangeable") summary(geeCKPlus)
# Create gee function to loop through all genestrajectories <-function(dat,genes){# Create function for modeling each gene geeF <-function(X){ geeM <-geeglm(X~ regionT, family=gaussian, id=subject_ID, data =dat,corstr ="exchangeable")# Get summary of results sum_res <-summary(geeM)# Output beta coeficient and pvaluec("Trajectory"= geeM$coefficients[2],se = sum_res$coefficients[2,2],"pvalue"= sum_res$coefficients[2,4],"intercept"= sum_res$coefficients[1,1]) }# Orderdata based on subject ID dat = dat %>%arrange(subject_ID)# Run gene trajectoris on the data traj <-apply(dat[,genes], MARGIN =2, FUN = geeF)# Return the resultreturn(t(traj)) }################################################################### Run trajectory analysis################################################################# Identify the genes we want to run it on. genes <-colnames(assay)# Plus trajectoriestrajectories_plus <-trajectories(model_datPlus,genes = genes[1:1000])# Minus trajectoriestrajectories_minus <-trajectories(model_datMinus,genes = genes[1:1000])############################################################################## Tables to summarise results############################################################################ CK + Trajectoriesas.data.frame(trajectories_plus) %>%mutate(# How many had a significant trajectory?sig =case_when(pvalue<0.05~"sig", pvalue >=0.05~"NotSig"), # UPward or downward trajectory?traject =case_when( Trajectory.regionT <0& sig =="sig"~"Downward", Trajectory.regionT >0& sig =="sig"~"Upward") ) %>%summarise(SigTrajecory =sum(sig=="sig"), sigProp = SigTrajecory /n(),Upward =sum(traject =="Upward", na.rm = T),Downward =sum(traject =="Downward", na.rm = T)) %>% knitr::kable(col.names =c("Significant Genes", "Proportion Significant","Upward", "Downward"), caption ="Significant trajectories (CK+)")
Significant trajectories (CK+)
Significant Genes
Proportion Significant
Upward
Downward
257
0.257
79
178
Code
# Ck - trajectoriesas.data.frame(trajectories_minus) %>%mutate(# How many had a significant trajectory?sig =case_when(pvalue<0.05~"sig", pvalue >=0.05~"NotSig"), # UPward or downward trajectory?traject =case_when( Trajectory.regionT <0& sig =="sig"~"Downward", Trajectory.regionT >0& sig =="sig"~"Upward") ) %>%summarise(SigTrajecory =sum(sig=="sig"), sigProp = SigTrajecory /n(),Upward =sum(traject =="Upward", na.rm = T),Downward =sum(traject =="Downward", na.rm = T)) %>% knitr::kable(col.names =c("Significant Genes", "Proportion Significant","Upward", "Downward"), caption ="Significant trajectories (CK-)")
Significant trajectories (CK-)
Significant Genes
Proportion Significant
Upward
Downward
250
0.25
136
114
Code
################################################################################# Trajectory Heatmaps################################################################################### CK plus heatmap ########################################## Find all significant genes with upward trajectoryupgenes <-as.data.frame(trajectories_plus) %>%filter(pvalue <0.05& Trajectory.regionT>0) %>%arrange(-Trajectory.regionT) %>%rownames()downgenes <-as.data.frame(trajectories_plus) %>%filter(pvalue <0.05& Trajectory.regionT<0) %>%arrange(Trajectory.regionT) %>%rownames()# Create matrix for heatmapsUpMat <- model_datPlus %>%arrange(regionT) downMat <- model_datPlus %>%arrange(regionT) # Heatmaps for upward and downward genespheatmap(t(scale(UpMat[,upgenes[1:50]],T,T)), cluster_rows = F, cluster_cols = F,labels_col = UpMat$region, main ="Normalized Log-scaled CPM upward trajectories (CK-Plus)")